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© Apparatus and method for collective branching in a multiple instruction stream multiprocessor. 



© A Multiple Instruction Stream Multiple Data 
Stream (MIMD) parallel processing apparatus and 
compiling method for effectuating collective branch- 
ing of execution by the processors includes special- 
ized branch and fuzzy barrier units which operate 
with respect to special instructions scheduled in un- 
shaded regions of the instruction streams of the 
processors involved in a collective branch. A special 
compare instruction is scheduled in a first unshaded 
region of only one of the processors while a special 
{^jump instruction is scheduled in the next unshaded 
^region of the instruction stream of the other involved 
-processors. By the special jump instruction, the oth- 
^.er processors use the special compare result which 
©is simultaneously passed to each of them by the 
branch unit for determining the execution branch. 
©The barrier unit provides fuzzy barrier synchroniza- 
tion assuring that the correct compare result is used 
Qin this determination. 
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Apparatus and method for collective branching In a multiple Instruction stream multiprocessor. 



RELATED APPLICATION 



This application is related to our co-pending 
U.S. Patent Application, Serial No. 227,276, entitled 
"METHOD AND APPARATUS FOR SYNCHRONIZ- 
ING PARALLEL PROCESSORS USING A FUZZY 
BARRIER", which was filed on August 2, 1988. 
Said application is hereby incorporated herein by 
reference. 



BACKGROUND OF THE INVENTION 



The present invention relates generally to mul- 
tiprocessor apparatus having parallel processors for 
execution of parallel related instruction streams and 
the compiling method for generating the streams. 
In its particular aspects, the present ivnention re- 
lates to apparatus and method for collective 
branching of execution by the processors. 



barrier points have been fixed and processors have 
been equipped for issuing an "I GOT HERE" flag 
to a barrier coordinating unit when a barrier is 
reached by the processor, which then stalls or idles 

5 until receipt from said unit of a "GO" instruction 
issued when all processors have issued their "I 
GOT HERE" flag. Illustrative are U.S. Patent Nos. 
4,344,134; 4,365,292; and 4,212,303 to Barnes in- 
dividually or with others. 

10 Heretofore, lockstep operation of parallel pro- 
cessors has been necessary for the execution of a 
branching instruction, i.e. the evaluation of a con- 
dition, testing the resulting value, and executing a 
branch, such as selectively jumping to an instruc- 

rs tion address based upon or defined by the result of 
the test. The lockstep operation ensures that the 
various processors take the same branch or jump. 
While collective branching is important to fully ex- 
ploit instruction level parallelism, the prior art has 

20 generally restricted multiple instruction stream ar- 
chitecture to execution of programs with no data- 
dependent branching (see for example said U.S. 
Patent No- 4,365,292 at column 3. lines 48-53). 

25 

SUMMARY OF THE INVENTION 



2. Description of the Prior Art 

Multiprocessor techniques for the exploitation 
of instruction level parallelism for achieving in- 
creased processing speed over that obtainable with 
a uniprocessor are known for VERY LONG IN- 
STRUCTION WORD (VLIW) machine architecture. 
A theoretical VLIW machine consists of multiple 
parallel processors that operate in lockstep, execut- 
ing instructions fetched from a single stream of 
long instructions, each long instruction consisting of 
a possibly different individual instruction for each 
processor. A run-time delay in the completion by 
any one processor of its individual instruction, due 
to unavoidable events such as memory access 
conflicts, delays the issuance of the entire next 
long instruction for all processors. 

Known MULTIPLE INSTRUCTION STREAM 
MULTIPLE DATA STREAM (MIMD) architecture 
enables processors to operate independently when 
the long instructions are partitioned into separate or 
multiple streams. By independence, we mean that 
a run-time delay in one stream need not imme- 
diately delay execution of the other streams. Such 
independence however, cannot be complete since 
a mechanism must be provided to enable the pro- 
cessors to periodically synchronize at barrier points 
in the instruction streams. In the prior art, such 



In accordance with the principles of the inven- 

30 tion, an instruction for the evaluation and testing of 
a condition to be used for collective branching by a 
plurality of processors is scheduled on only one of 
the processors as a "special compare" instruction. 
Means are provided coupling the processors in a 

35 manner that the result of the special compare in- 
struction is made available to or passed to the 
other processors involved in the collective branch- 
ing. Each other involved processor executes a 
"special jump" instruction which utilizes the passed 

ao special compare result for determining, when the 
special jump instruction is executed, the location or 
address of the next instruction in said processor's 
instruction stream. In the instruction stream of the 
processor which evaluated the special compare 

45 instruction and subsequent to or downstream from 
said instruction, is generally provided a "regular 
jump" instruction which uses the compare result 
locally generated by that processor for execution of 
its own jump instruction. 

so To assure that all processors involved in ex- 
ecution of collective branching take the same ex- 
ecution branch, it is necessary that the processors 
be synchronized. By the term "synchronized" we 
mean merely that data or logical dependence be- 
tween the instruction streams are sufficiently re- 
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solved to assure that each jump instruction is eval- 
uated based on the same special compare result. 
Thus, a special jump cannot be executed until after 
a special compare result is passed and a new 
special compare result cannot be passed until all 
processors involved in the last collective branch 
have used the last special compare result. 

While fixed barriers may be interposed in the 
multiple instruction streams to force previously 
known "synchronizations" of the kind where pro- 
cessors idle at a fixed barrier until the last proces- 
sor reaches the barrier.we have found that a "fuzzy 
barrier" as described in our co-pending application 
aforementioned provides the necessary level of 
synchronization without the consequent delays of 
previously known fixed barriers. 

Thus, in compiling source code (typically 
originally written for serial execution on a un- 
iprocessor) in order to schedule instructions into 
parallel streams for parallel processor execution, 
the streams are divided into alternating correspond- 
ing "shaded" and "un shaded" regions. "Shaded 
regions" contain instructions which are data in- 
dependent from instructions in other streams (or 
are otherwise inherently synchronized in writing 
and reading shared resources so as to not require 
barrier synchronization) while unshaded regions 
contain instructions which require barrier synchro- 
nization principally because they use data gen- 
erated by another processor, typically in the imme- 
diately preceding unshaded region of the other 
processor's instruction stream. 

According to the principles of the invention, 
fuzzy barrier means are provided in which each 
processor has means for identifying shaded and 
unshaded regions in its instruction stream and 
means for receiving a n want_Jn" signal from each 
other processor indicating when each other proces- 
sor wants to synchmize. The barrier means further 
includes a state machine for selectively generating 
a "want-out" signal indicating when the processor 
wants to synchronize and a signal for selectively 
stalling or idling execution of the processor. 

In accordance with the further principles of the 
invention, the special compare instruction is sched- 
uled in a first unshaded region of the instruction 
stream of only one of the processors and the 
special jump instruction is scheduled in the next 
following or second unshaded region of the one or 
more other processors involved in the branch, 
there being a shaded region intermediate the first 
and second unshaded regions. The regular jump 
instruction in the stream of the one processor 
which evaluates the special compare instruction is 
scheduled in a second unshaded region which cor- 
responds to the second unshaded region in which 
the special jump is scheduled in the instruction 
streams of the other involved processors. 



The fuzzy barrier means assures synchroniza- 
tion of the type that processors reaching the end of 
corresponding shaded regions, before synchroniza- 
tion, will stall until the last processor has at least 

5 entered its corresponding shaded region. Upon 
such entry of the last processor, synchronization 
takes place and processors are free to pass to the 
next following unshaded region and the next follow- 
ing shaded region, where, upon entry, they individ- 

70 ually issue a want_out signal directed to the other 
processors indicating a desire to synchronize. In 
contradistinction to prior art fixed barrier methods, 
the processors continue to execute instructions in 
the shaded region while waiting for synchronization. 

75 If synchronization has not yet occurred when the 
end of the shaded region is reached by a proces- 
sor it will stall while continuing to issue the 
want^out signal. Synchronization will occur when 
all processors issue a want__out signal which syn- 

20 chronization will generally cause the want__out sig- 
nals to be reset. It is thus apparent that scheduling 
the special compare instruction in an unshaded 
region of one processor's instruction stream, and 
scheduling the regular jump and special jump 

25 instructions in the next following corresponding un- 
shaded regions of the instruction streams will not 
allow any jump to be executed until the special 
compare has been first evaluated and will not allow 
a special compare to be evaluated unless the im- 

so mediately preceding unshaded region has been 
passed by all processors. 

In order for barrier synchronization to achieve 
this result, it is generally necessary that upon eval- 
uation of the special compare result by one proces- 

35 sor, the result is substantially simultaneously 
passed to all processors involved in the collective 
branch. Accordingly, data exchange means are 
provided especially for passing a special compare 
result word much more rapidly than possible by 

40 writing and reading shared memory or shared reg- 
ister channels which data exchange means further 
obviate the possibility of conflicts in or blocking on 
these shared resources. 

As explained in more detail in our co-pending 

45 application in relation to the fuzzy barrier, since the 
number of instructions in the shaded regions of the 
instruction streams (for example that shaded region 
intermediate the first and second unshaded re- 
gions) provides a cushion for non-idling instruction 

so execution while awaiting synchronization, the 
instructions should be compiled in accordance with 
a method that instructions are scheduled in the 
unshaded regions to maximize the size of or num- 
ber of instructions in such regions. Accordingly, in 

55 the compiler method according to our co-pending 
application as applied to the collective branching 
instructions, serial instruction code is separated 
into separate parallel streams in which correspond- 
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ing unshaded and shaded regions have been iden- 
tified and with the special compare instruction in 
the unshaded region of one stream and the special 
jump instructions in the next following unshaded 
region of the other instruction streams involved in 
the collective branching. Furthermore, the stream 
instructions are thereafter recorded to move those 
instructions into the shaded region intermediate the 
aforementioned unshaded regions which, prior to 
recording, are downstream from said shaded re- 
gions but may be earlier executed without com- 
promising downstream data dependencies. 



BRIEF DESCRIPTION OF THE DRAWING 



Other objects, features and advantages of the 
present invention will become apparent upon pe- 
rusal of the following detailed description of the 
preferred embodiments when taken in conjunction 
with the appended drawing wherein: 

Figure 1 is a block diagram of a multiproces- 
sor according to the present invention, each par- 
allel processor thereof including a barrier unit and a 
branch unit; 

Figure 2 is a block diagram of a branch unit 
in Rgure 1; 

Figure 3 is a representation of corresponding 
instruction streams for the parallel processors in 
Rgure 1; 

Rgure 4 is a block diagram of the barrier 
unit in Rgure 1 including a state machine; 

Rgure 5 is a state diagram for the state 
machine in Figure 4; and 

Rgure 6 is a flow chart for the compiling 
method in accordance with the invention. 



DETAILED DESCRIPTION OF THE PREFERRED 
EMBODIMENTS 



Referring first to Rgure 1 of the drawing, there 
is schematically illustrated a multiprocessor ap- 
paratus 10 in accordance with the Invention, which 
comprises an array of processors in Multiple In- 
struction Stream Multiple Data Stream (MIMD) con- 
figuration. Four processors are shown designated 
P1, P2, P3, and P4, said number being chosen 
both for the purposes of illustration and as a num- 
ber of processors which may be feasibly integrated 
together on a single chip. Each of processors P1- 
P4 is preferably both Identical and symmetrically 
connected in the multiprocessor 10 to allow for 
flexibility for scheduling operations in individual in- 
struction streams for processing in parallel by the 
processors P1-P4. 



Processors P1, P2, P3 and P4 respectively 
have their own dedicated instruction memories 11, 
12, 13 and 14 for sequential instructions forming the 
respective instruction streams. There are also 

5 shared memory and register channel resources 12 
which have address input lines 14 and bi-direc- 
tional data lines 16 to the respective processors 
P1-P4. The memory 12 is shared in that any mem- 
ory location can be written or read by any proces- 

70 sor utilizing a suitable cross bar (not shown) for- 
ming a part of memory 12. Instruction memories 
11-14 may operate as caches for instruction se- 
quences fetched, when needed, from shared mem- 
ory 12 via lines 18. 

75 A limited number of register channels prefer- 
ably included in these shared resources provide 
means for more rapidly passing data between pro- 
cessors than can be obtained using the shared 
memory portion of shared resources 12. Each reg- 

20 ister channel may be written and read by the 
processors in the nature of a register, and also has 
an associated communication protocol or synchro- 
nizing bit, in the nature of channel, which indicates, 
for example, whether the register channel is full or 

25 empty. Such register channels provide inherent 
synchronizations between processors because a 
processor cannot read a register channel until after 
it is written by another processor. While such regis- 
ter channels could conceivably be used to pass 

30 data in a synchronized fashion between processors 
for accomplishing a collective branching instruction, 
the limited number of such register channels. and 
the delays caused by strategies to avoid blocking 
on said register channels calls for a more aggres- 

35 sive solution utilizing specialized hardware for both 
collective branching and the synchronization there- 
of. 

The processors P1-P4, each comprise an ex- 
ecution means 20 consisting of a control unit 22 

40 which is coupled to an arithmetic and logical unit 
(ALU) 24. Control units 22 of the respective proces- 
sors P1-P4 receive instructions from the respective 
instruction memories 11-14 on lines 26 and issue 
addresses to the respective instruction memories 

45 as well as to the shared memory and register 
' channels on lines 14. The data lines 16 are bi- 
directionally coupled to the ALU 24 of each proces- 
sor. 

The processors P1-P4 each further comprise 
so specialized hardware in the form of a barrier unit 
28 and a branch unit 30. Barrier unit 28 is for 
coordinating synchronization of the processors P1- 
P4 with respect to fuzzy barriers established in the 
instruction streams of the processors as described 
55 in our aforementioned co-pending patent applica- 
tion, branch unit 30 is for providing a collective 
branching result word to processors P1-P4, or a 
sub-group of them involved in collective branching 
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parallel instructions, with the aid of the fuzzy barrier 
synchronization provided by barrier unit 28. Each 
of branch units 30 has output lines "Oi" which form 
inputs to the branch units 30 of the other proces- 
sors and receives similar inputs from each of the 
other processors. Barrier units 28 are similarly con- 
nected to each other which connections are repre- 
sented schematically as the bi-directional coupling 
32 in Figure 1. 

The nature and purpose of both the branch unit 
30 and the barrier unit 28 as well as the interaction 
between units 28 and 30 of each processor with 
units 28 and 30 of the other processors and the 
interactions within each processor between such 
units and execution means 12 are best understood 
by reference to Figures 3 and 6. 

Figure 3 shows streams S1-S4 of illustrative 
instructions for the respective processors P1-P4 in 
which execution progresses downwardly. For the 
purpose of example, streams S1, S2 and S3 con- 
tain related instructions so that processors P1, P2 
and P3 are involved in a sequence of two collective 
branches while stream S4 contains unrelated 
instructions for execution by processor P4. Under 
such circumstances, processor P4, not being in- 
volved in the collective branch, is "masked out" by 
barrier unit 28 in a manner which will be later 
discussed. 

The instructions contained in the streams are 
preferably derived by a compiling process which is 
illustrated by the flow chart in Figure 6. Therein, 
compiling begins with the source code 34 and in 
step 36 there is derived a parallel stream code 
using techniques for VERY LONG INSTRUCTION 
WORD VLIW) compiling such as taught by R.P. 
Colwell et al "A VLIW Architecture For A Trace 
Scheduling Compiler", Proc. Second International 
Conf. On Architectural Support For. Programming 
Languages And Operating Systems, pp. 180-182, 
1987. Furthermore, in accordance with the princi- 
ples of the invention, the stream code is developed 
in step 36 so that the testing and evaluation of a 
branching condition is scheduled on only one of 
the processors as a "special compare" (CMPSP) 
instruction (such as in S1, Figure 3) and the subse- 
quent related "special jump" (JMPSP) instruction is 
scheduled on all other processors involved in the 
collective branch (such as in region 42 in S2 and 
S3 in Figure 3) which other processors receive the 
results of the special compare performed by P1 via 
the branch unit 30. Processors P2 and P3 use the 
CMPSP result received to take the execution 
branch indicated thereby (i.e. determine the ad- 
dress of the next instruction). 

In the stream S1, corresponding to the sched- 
uling of JMPSP in streams S2 and S3, there is 
scheduled a "regular jump" (JMPR) instruction by 
which processor P1 is instructed to use the locally 



generated CMPSP result to determine which 
branch it takes. Thus, each of processors P1-P3, 
involved in the collective branching, are to take the 
same branch, but not necessarily at the same- 

5 instant. Because each of processors P1-P4 make 
variable progress in executing instructions in the 
respective streams S1-S4, barrier synchronization 
is necessary to. insure that a JMPSP is executed 
only after the corresponding CMPSP result has 

70 been posted to the branch unit 30 and that the next 
CMPSP result is not posted to branch unit 30 until 
the last CMPSP result has been used by all in- 
volved processors. 

This barrier synchronization is provided by bar- 

75 rier unit 28 which operates with respect to iden- 
tified "shaded" and "unshaded" regions in the in- 
struction streams. Accordingly, step 38 is per- 
formed after step 36 wherein corresponding 
shaded and unshaded regions are established in 

20 the respective instruction streams. As indicated in 
our co-pending application, this can be done by 
having instructions include a bit which describes 
the region in which said instructions lie or by 
scheduling region boundary instructions in the 

25 streams. Shaded regions contain only instructions 
which are independent and/or do not require barrier 
synchronization because they are otherwise syn- 
chronized as by the registered channels contained 
in shared resources 12. Unshaded regions contain 

30 instructions which are dependent on mathematical 
or logical data generated by another processor, 
which will not necessarily be available when need- 
ed without barrier synchronization. 

The nature of the fuzzy barrier established with 

35 respect to these shaded and unshaded regions is 
that the involved processors will be said to be 
synchronized only when each has at least reached 
its corresponding shaded region. The processors 
continue to execute instructions in the shaded re- 

40 gions while waiting to synchronize. If synchroniza- 
tion has not occurred when a processor reaches 
the end of a shaded region, the processor stalls or 
idles waiting for the last other processor to reach 
the shaded region. When the last processor 

45 reaches such region, synchronization takes place, 
and all processors are now permitted to execute 
instructions In the next following unshaded region 
and pass thereafter into the next following region 
where synchronization is similarly sought. Conse- 

50 quently, with fuzzy barrier synchronization, in view 
of the shaded region intermediate sequential first 
and second unshaded regions, the first unshaded 
region must be fully executed by all involved pro- 
cessors before' any processor executes an instruc- 

55 tion in the second unshaded region. Accordingly, a 
first set of corresponding unshaded regions 40 is 
established in streams S1, S2 and S3 (Figure 3) 
with the CMPSP instruction scheduled sin region 
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40 of stream S1 only. The related JMPR and 
JMPSP instructions are scheduled in second cor- 
responding unshaded regions 42 in streams S1, S2 
and S3 with the JMPR instruction in. stream S1, 
and the JMPSP instruction in streams S2 and S3. 
Intermediate the unshaded regions 40 and 42 is a 
shaded region 44 in each of streams S1, S2 and 
S3. 

Further in accordance with the compiling meth- 
od, the number of instructions in shaded regions 
such as 44 is maximized in step 46 by recording 
the stream code to move those later occurring 
instructions which can be safely executed earlier 
without compromising data dependencies, into the 
earlier occurring shaded region. This increases the 
cushion for non-idling synchronization between the 
involved processors. The reordering method is 
more fully described in our co-pending application. 
As a final step 48, the reordered stream code is 
assembled. 

In accordance with the principles of the present 
invention, in order to simplify the branch unit 30, 
only one set of processors can be involved during 
any shaded region in a collective branch. However, 
a different processor can be scheduled to execute 
a CMPSP instruction for a subsequent collective 
branch. Thus, there is further illustrated in Figure 3, 
a second collective branching by a shaded region 
50 and an unshaded region 52, next following, in 
which CMPSP is scheduled in stream S2. Regions 
50 and 52 establish fuzzy barrier synchronization 
such that each involved processor uses the 
CMPSP result posted by P1 prior to the next 
CMPSP being executed by P2. Operation with re- 
spect to the next following shaded region 54 and 
unshaded region 56 thereafter in which JMPR is 
scheduled in stream S2 and JMPSP is scheduled 
in streams S1 and S3 should be apparent from the 
first collective branching heretofore .described. 
While processor P4 is illustrated in Figure 3 as 
being "masked out" and operating throughout in an 
unrelated shaded region, it can synchronize and 
join in related processing when "masked in". 

It should be further understood that not all 
branching of the processors need be collective, 
there being instances where one or more proces- 
sors may branch individually within the context of a 
program. For such circumstances, a regular com- 
pare instruction (CMPR) is provided, the result of 
which is used locally by a subsequent JMPR in- 
struction and not passed to the other processors 
via branch unit 30. 

The operation of branch unit 30 will now be 
described with reference to Figures 1 and 2. In the 
processor on which the CMPSP instruction is 
scheduled, CMPSP will be evaluated in ALU 24 
and the multibit result word provided as a parallel 
input "cc" to the branch unit 30. Further, a single 



bit output B is provided by the control unit indicat- 
ing, when logical "1", the posting of the CMPSP 
result word to the branch unit by said processor, 
which output acts as an "enable" for said result 

5 word. Thus, the individual bits of "cc" are sepa- 
rately gated by B in an array of AND gates 60 (one 
for each bit of "cc") producing a parallel first 
output "scc_out" while "B" alone produces a sin- 
gle bit second output "en_out", these first and 

70 second outputs together comprising Oi. 

The outputs scc__out and en__out from the 
other three processors form three parallel inputs 
sccjn and three single bit inputs en_jn. Cor- 
responding bits of the three inputs scc_in are 

75 input to an array of OR gates 62 while inputs 
en_in are input to OR gate 64, the outputs of 
which form the enable for a latch 68 for the parallel 
output of the array of OR gates 62. Thus, latch 68 
will contain a special compare result evaluated by 

20 another of the processors. A 2 to 1 multiplexer 70 
receives alternative parallel inputs from the outputs 
of latches 66 and 68 while B provides a selection 
signal such that, in the processor which evaluated 
CMPSP (where B goes to logical "one") the output 

25 of latch 66 is passed by multiplexer 70 to its output 
sec, while in the processors which receive the 
CMPSP result from another processor (where B 
remains at logical "zero"), the output of latch 68 is 
passed by the multiplexer 70 to output sec. 

30 The barrier unit 28 is now described in con- 
junction with Figures 1 , 4 and 5. The barrier unit 28 
of each processor receives a want_Jn signal from 
the barrier units of each of the other processors 
and receives a mask signal M and a region iden- 

35 tifying signal I from the control unit 22 of said 
processor. Mask signal M, which is loaded into a 
mask register 72, is derived by control unit 22 from 
instructions in said processors instruction stream 
and indicates which other processors are involved 

40 with said processor in barrier synchronization. 
Mask signal M has a different bit position for each 
other processor; one logical state of a bit indicates 
that the other processor associated with the bit 
position is involved or "masked in " and the op- 

45 posite logical state indicates .that the associated 
other processor is "masked out". The parallel out- 
put of mask register 72 and the want_jn signals 
from the other barrier units are applied to' a match 
circuit 74 which provides a MATCH signal at its 

so output when all other processors which are 
"masked in" have issued a want__in signal. Match 
circuit 74 is further described in our aforemen- 
tioned c-pending application and is easily imple- 
mented by one of ordinary skill in the art. The 

55 match circuit output, which is logical "1" when a 
match is achieved, and the region identity signal I, 
which is logical "1" or occurs when said processor 
is operating in a shaded region of its instruction 
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stream, are input to a state machine 76. 

State machine '76 has the state diagram of 
Figure 5 and generates a want_put signal, which 
forms one of the want__in signals to the barrier unit 
28 of each of the other processors, and a stall 
signal (indicated in the state diagram when ac- 
tivated as "STALL") directed to said processor's 
control unit 22 to selectively stall or idle further 
instruction execution by said processor. Want_out 
signals occur or are logical "1 " when the processor 
generating said signal has not synchronized and 
wants to synchronize, said occurrence or logical 
"1" state being indicated as "WANT_p(JT" in the 
state diagram. The I signal generated by a control 
unit 22 occurs or is logical "1 " when the processor 
is operating in a shaded region and is logical "0" 
when the processor is either stalled at the end of 
the shaded region or is operating in an unshaded 
region. 

As shown in the state diagram of Figure 5, 
where the possible input states and their com- 
plements are indicated as "MATCH", "MATCH*", 
"I" AND "I*" state machine 76 has four states as 
follows: 

State "0" where, synchronization having oc- 
curred, the processor operates in an unshaded, 
region; 

State "1" where the processor operates in a 
shaded region waiting to synchronize; 

State "2" where the processor operates in a 
shaded region and has already synchronized; and 

State "3" where the processor is stalled at 
the end of a shaded region waiting to synchronize. 

In state 0, state machine 76 will neither issue 
WANT_OUT nor STALL. It will remain in state 0 
as long as P (unshaded region) is true. When the 
shaded region is encountered I becomes true caus- 
ing either transition 78 to state 2 or transition 80 to 
state 1 depending upon whether MATCH or 
MATCH* is true. In transition 78 WANTJDUT is 
maintained during the transition and discontinued 
or reset at state 2, indicating the occurrence of 
synchronization, in transition 80, WANT_OUT is 
maintained both during the transition and on reach- 
ing state 1 indicating that the processor is in the 
shaded region and wants to synchronize. State 
machine 76 remains in state 1 as long as I is true 
and MATCH* is true. If MATCH is true before the 
end of the shaded region is reached, transision 82 
will be taken from state 1 to state 2 wherein 
WANT_OUT is maintained during the transition 
and reset when state 2 is reached in view of the 
occurrence of synchronization. 

State 2 is maintained as long as I is true, but 
when the unshaded region is encountered P be- 
comes true and the transition 84 from state 2 to 
state 0 is taken. 

Furthermore, if MATCH and P become true 



simultaneously, corresponding to a match occur- 
ring precisely upon reaching the end of a shaded 
region, the direct transition 86 from state 1 to state 
0 is taken. WANT_OUT is maintained during that 

6 transition and reset on reaching state 0. 

If however, while in state 1 , the end of the shaded 
region is encountered prior to the occurrence of 
synchronization (P true and MATCH* true) , the 
transition 88 to state 3 is taken. During this transi- 

70 tion WANT_OUT is maintained while STALL is 
activated, which conditions are maintained during 
state 3. 

As long as MATCH* is true, state machine 76 
remains in state 3. Once MATCH is true, transition 

75 90 is taken from state 3 to state 0 where the 
processor may now proceed to the next unshaded 
region. WANT_OUT is maintained during this tran- 
sition and reset upon reaching state 0. 

It should now be apparent that barrier unit 28 

20 provides the necessary synchronization in conjunc- 
tion with the branch unit 30 to achieve collective 
branching. While the invention has been described 
in specific detail it should be appreciated that nu- 
merous modifications, additions and/or omissions in 

25 these details are possible within the intended spirit 
and scope of the invention. For example, branch 
unit 28 and barrier unit 30 could be more closely 
integrated into a single and/or the resulting hard- 
ware could be simplified by requiring that all 

30 CMPSP instructions be scheduled on the same 
processor. 

Claims 

35 

1. A digital processing apparatus comprising a 
plurality of respective, cooperating processors for 
executing in parallel respective streams of instruc- 
tions, means being provided for effectuating a col- 

40 lective branching of execution by the processors, 
characterized in that: 

- at least a first processor of said plurality com- 
prises a first branching means for executing a 
special compare instruction in a respective first 

45 stream of instructions for thereupon producing a 
special compare result; 

- and at least a further processor of said plurality is 
coupled with the first processor for being control- 
lable through said special compare result and com- 

so prises a second branching means for in depen- 
dence on the special compare result executing an 
associated special jump instruction in a respective 
further stream of instructions. 

2. A digital processing apparatus as claimed in 
55 Claim 1, characterized In that the first processor 

includes third branching means for in dependence 
on the special compare result executing a regular 
jump instruction in the first stream of instructions, 
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subsequent to said special compare instruction. 

3. A digital processing apparatus as claimed in 
Claim 1 or 2, characterized in that a synchronizing 
means is included for preventing the further pro- 
cessor from executing the special jump instruction 
before the associated special compare result has 
been made available by the first processor. 

4. A digital processing apparatus as claimed in 
Claim 3, characterized in that the synchronizing 
means includes a communication means coupled 
between the first processor and the further proces- 
sor for communicating a synchronization signal to 
the further processor for synchronizing with the first 
processor. 

5. A digital processing apparatus as claimed in 
Claim 1 or 2, characterized in that: 

- the further processor comprises an associated 
first branching means for executing a special com- 
pare instruction in the respective further stream of 
instructions for thereupon producing an associated 
special compare result; 

- and the first processor is coupled to the further 
processor for being controllable through said asso- 
ciated special compare result and comprises an 
associated second branching means for in depen- 
dence on the special compare result executing an 
associated special jump instruction in the first 
stream of instructions. 

6. The apparatus of Claim 3 or 4 wherein said 
first and further streams of instructions each may 
comprise a related sequence of three regions: 

a first unshaded region, a shaded region, and a 
second unshaded region, said special compare in- 
struction being in the first unshaded region of the 
first stream and the special jump instruction being 
in the second unshaded region of the further 
stream and wherein said synchronizing means 
comprises identifying means for identifying shaded 
and unshaded regions in the further instruction 
stream and stalling means for stalling execution of 
said further processor at the end of said shaded 
region if said first processor has not yet entered 
the shaded region. 

7. A digital processing apparatus comprising a 
plurality of respective, cooperating processors for 
executing in parallel respective streams of instruc- 
tions, means being provided for effectuating a col- 
lective branching of execution by the processors, 
characterized in that; , 

- the streams of instructions each are dividable into 
mutually corresponding and alternately shaded and 
unshaded regions, and which streams include col- 
lective branching instructions including a special 
compare instruction in a first unshaded region in 
one of said streams and a related special jump 
instruction in a second unshaded region in other of 
said streams, 

- each processor including a respective first 



branching means for executing a special compare 
instruction in its associated stream of instructions, 
for thereupon producing a special compare result; 

- each processor including a respective second 
5 branching means for in dependence on the special 

compare result for another processor executing a 
special jump instruction in its associated stream of 
instructions; 

- synchronizing means being provided for region- 
10 wise synchronizing said plurality of processors to 

ensure that instructions in said second unshaded 
region are not executed until all processors have at 
least reached their respective shaded regions. 

8. A digital processing apparatus as claimed in 
75 Claim 7, characterized in that the synchronizing 

means include: 

- identifying means for identifying for each proces- 
sor whether the processor is currently operating in 
a shaded region or an unshaded region and for 

20 thereupon generating a respective identification 
signal associated with each respective processor; 

- reception means in each processor for receiving 
want-in signal from at least another processor for 
synchronizing with the other processor; 

25 - controlling means in each processor responsive 
to said associated identification signal and to said 
want-in signal for generating a want-out signal for 
transmission to the other processor and for stalling 
execution of the stream of instruction associated 

30 with the processor. 

9. A digital processing apparatus as claimed in 
Claim 8, characterized in that the synchronizing 
means includes for each processor a state machine 
that can assume one of the following states: 

35 a) a first state during which the processor 

executes an instruction in an unshaded region; 

b) a second state during which the processor 
executes an instruction in a shaded region while 
waiting for other processors to reach their respec- 

40 tive shaded regions; 

c) a third state during which the processor 
executes an instruction in a shaded region when 
the other processors have reached their respective 
shaded regions; and 

45 d) a fourth state during which the processor 

stalls, having reached an end of a shaded region, 
and waits for the other processor to reach their 
respective shaded regions. 

10. A method for compiling serial instruction 
so code including a branching instruction character- 
ized in that the branching instruction comprises a 
compare instruction and a related jump instruction 
for scheduling related execution in parallel streams 
of instructions, the method comprising: 

55 separating said serial instruction code into streams 
including first scheduling the compare instruction in 
one of the streams and second scheduling a re- 
lated jump instruction in the other of the streams. 
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11. The method of Claim 10 characterized in 
that the method further comprises third scheduling 
further instructions in said streams between said 
compare instruction and said jump instruction. 

12. The method of Claim 11 characterized in 5 
that said third scheduling further comprises re- 
ordering the instructions of said streams to place 
said further instructions between said compare 
instructions and said related jump instruction. 

13. The method of Claim 12 further comprising w 
establishing shaded and unshaded regions in said 
streams by placing said compare instruction and 

said jump instruction in different unshaded regions 
separated by a shaded region and wherein said 
reordering step comprises moving into the shaded 75 
region instructions occurring after the shaded re- 
gion which will be executed earlier. 
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SOURCE CODE 



SEPARATE INTO STREAM CODE 
PUTTING CMPSP IN ONE STREAM AND 
JNPSP IN OTHER INVOLVED STREAMS 



ESTABLISH CORRESPONDING SHADED 
AND UNSHADED REGIONS WITH 

CMPSP AND JMPSP IN 
DIFFERENT UNSHADED REGIONS 
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REORDER STREAM CODE TO 
MAXIMIZE SHADED REGION 
BETWEEN CMPSP AND JMPSP 
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